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ABSTRACT 

XML document markup is highly repetitive and therefore well com- 
pressible using dictionary-based methods such as DAGs or gram- 
mars. In the context of selectivity estimation, grammar-compressed 
trees were used before as synopsis for structural XPath queries. 
Here a fully-fledged index over such grammars is presented. The 
index allows to execute arbitrary tree algorithms with a slow-down 
that is comparable to the space improvement. More interestingly, 
certain algorithms execute much faster over the index (because no 
decompression occurs). E.g., for structural XPath count queries, 
evaluating over the index is faster than previous XPath implemen- 
tations, often by two orders of magnitude. The index also allows 
to serialize XML results (including texts) faster than previous sys- 
tems, by a factor of ca. 2-3. This is due to efficient copy han- 
dling of grammar repetitions, and because materialization is totally 
avoided. In order to compare with twig join implementations, we 
implemented a materializer which writes out pre-order numbers of 
result nodes, and show its competitiveness. 

1. INTRODUCTION 

An important task in XML processing is the evaluation of XPath 
queries. Such queries select nodes of an XML document and are 
used in many scenarios: embedded in larger XQueries, in XSL 
stylesheets, in XML policy specifications, in JavaScripts, etc. A 
common way of speeding up query evaluation is to use indexes. 
But conventional value indexes for XML tags and text values are 
not sufficient to answer XPath queries, because they do not capture 
the document's hierarchical structure. Therefore a large number of 
structural XML indexes have been introduced (see [12] for a recent 
overview). The first one was the DataGuide |13| . It stores a sum- 
mary of all distinct paths of a document. Later the finer 1 -index 
was proposed |25| which is based on node bisimulation. For cer- 
tain XPath queries these indexes allow evaluation without access- 
ing the original data; e.g., for structural queries restricted to the 
child and descendant axes. More fine-grained structural indexes 
were considered but turned out to be too large in practice, see 1 17]. 
As a compromise, the A(fc)-index (18| was proposed which uses 
node bisimilarity of paths up to length k; the D(fc) |28j and M(fc)- 
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indexes ^I6j are A(fc)-variants that adapt to query workloads. Up- 
dates for the A(k) and 1-indexes were studied in (33| . Index path 
selection is considered e.g., in }29]; but, their indexes are usually 
larger than the original documents (including data values). All in- 
dexes mentioned so far are approximative for full structural XPath, 
i.e., do not capture enough information to evaluate XPath's twelve 
navigational axes. This is in contrast to the indexes introduced here. 

A self index has the property that (1) it allows to speed up certain 
accesses, and (2) it can reproduce the original data (which there- 
fore can be discarded after index construction). Moreover, such 
indexes are often based on compression and hence are small (typi- 
cally smaller than the original data). For Claude and Navarro [7] a 
self-index for text must (at least) efficiently support the extract and 
find operations; these operations reproduce a portion of the text and 
find all occurrences of a substring, respectively. In XPath process- 
ing, more complex search than finding substrings is required. In 
fact, XPath search is comparable to regular expression search. Un- 
fortunately, even for text, little is known about indexes that support 
arbitrary regular expression search (see, e.g., |2 |). 

In |1, 24] it was observed that two particular navigational op- 
erations allow drastic speed-ups for XPath evaluation: taggedDesc 
and taggedFoU. Given a node and a label, these operations return 
the first descendant node and first following node with that label, 
respectively. During XPath evaluation these operations allow to 
jump to next relevant nodes; this cuts down the number of inter- 
mediate nodes to be considered during evaluation. The "QName 
thread" in MTree [27] is similar in spirit (it allows to jump to next 
descendants with a given label). 

The idea of our new index is to use grammar-compressed trees 
(which typically are much smaller than succinct trees [23]) and to 
add small data structures on top of these which support efficient 
taggedDesc and taggedFoll. Our contributions are 

1. a self-index for trees, based on grammar-based compression 

2. a generic sequential interface that allows to execute, over the 
new index, arbitrary algorithms on the original tree 

3. a special evaluator for counting of XPath query results, and 

4. special evaluators for serializing and materializing of XPath 
query results. 

We tested the generic interface of Point 2 on two algorithms: on 
depth-first left-to-right (dflr) recursive and iterative full tree traver- 
sals, and, on the (recursive) XPath evaluator "SXSI" of |1|. We 
obtain good time/space trade-offs. For instance, replacing SXSFs 
tree store with our interface of Point 2 gives a slow-down of factor 
4 while it slashes SXSI's memory use by factor 3 (averaged over 
the 16 tree queries of on a 1 16M XMark file). Our experiments 



show that the evaluators of Points 3 and 4 are faster than existing 
XPath implementations, often by two orders of magnitude. Note 
that the indexes used by these evaluators are so tiny in space (see 
Figure [TJ that any XML database can profit from them, by conve- 
niently keeping them in memory. This allows, besides others, fast 
serialization and fast XPath selectivity computation, and therefore 
can replace structural synopses. 
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Figure 1: TinyT Index Sizes (in MB) 



While the generic interface causes a slow-down due to decom- 
pression, there are classes of algorithms (over the grammar) which 
allow considerable speed-ups. Essentially, the speed-ups are pro- 
portional to the compression ratio, because the compressed gram- 
mar need only be traversed once. For instance, tree automata and 
Core XPath can be evaluated in one pass over straight-line tree 
(SLT) grammars | ,20J . This idea was used in yjj for selectivity 
estimation of structural XPath. They study synopsis size and accu- 
racy, but do not consider efficient evaluation. We combine the ideas 
of (24[[T) with that of evaluating in one pass over the grammar. To 
this end, we augment the grammar with information that allows ef- 
ficient taggedDesc and taggedFoll: for every nonterminal X and 
terminal symbol t of the grammar a bit is stored that determines 
whether X generates t. If X does not generate t, then it may be 
"jumped" during a taggedDesc call for t. Our first structural index 
comprises this "jump table", together with a compact representa- 
tion of the grammar. The XPath count evaluator of Point 3 executes 
over this index. To obtain grammars from XML structure trees, we 
use the new TreeRePair compressor |21 1. Due to compression, the 
resulting indexes are phenomenally small. For instance (cf. Fig- 
ure [T}, our index can store the half a billion nodes of an 11GB 
XMark tree in only 8MB! This means an astonishing 8.7 nodes per 
bit! Consequently, our evaluator (which, due to jumping, need not 
even visit the whole grammar) is extremely fast. Compared to the 
fastest known evaluators, MonetDB |4| and Qizx (32), we found 
that our XPath count evaluator is faster by 1-2 orders of magni- 
tude, for essentially all queries we tested. 

Motivated by our positive results from Point 3, the question arose 
whether the count evaluator can be extended to handle proper XPath 
semantics, i.e., to output XML subtrees of result nodes. Since se- 
rialization involves outputting of data values, all data values are 
now stored in a memory buffer. Additionally, a data structure is 
built that links the SLT grammar to the correct data values. This is 
achieved by storing for each nonterminal the number of text-values 
that it generates. In fact, since a nonterminal generates a tree pat- 
tern which has many "dangling subtrees", we need to store tuples 
of such numbers: the first component is the number of text- values 
in the first "chunk" of the nonterminal, i.e., in the tag sequence (of 
the nonterminal) before the first dangling subtree; next is the num- 
ber of text-values in the second chunk, i.e., between the first and 
second dangling subtrees, etc. Evaluation is still done in one pass 
through the grammar, but, this time must follow a strict dflr traver- 
sal (which makes it slower than for count queries). A nice bonus 
is the possibility to make clever use of hashing: we remember the 
"chunks" of XML markup produced by each nonterminal. This 
greatly speeds up serialization. Moreover, it turned out that mate- 
rialization of result nodes can be totally avoided. Thus, neither ex- 
pensive grammar node IDs need to be stored, nor their translation 



to pre-order numbers is needed. Rather, whenever a result node 
is encountered during evaluation, we start a serialization process 
which works in parallel with evaluation. The resulting system out- 
performs by a factor of 2-3 the fastest known system SXSI (which 
on its own outperforms MonetDB and Qizx, see fl |). 

About the comparison: it can be argued that comparing our rudi- 
mentary XPath evaluator with full-blown XML databases is unfair, 
because these larger systems have more overhead (such as locking, 
transaction handling, updates). On the other hand, these systems 
are highly optimized and therefore could exploit their best avail- 
able algorithm for simple queries. We therefore believe that the 
comparison is relevant. Note that we also compare with special- 
ized implementations which handle smaller or incomparable XPath 
fragments. For instance, we compared to the fastest available im- 
plementations of twig joins (14[|15| . Since these algorithms mate- 
rialize result nodes, we implemented an experimental materializer 
(Point 4). Interestingly, it often outperforms these twig implemen- 
tations (which represent state-of-the art of many years of research 
on twigs). We also compare to the index of |,9,,10] which handles 
simple paths (XPaths with one // followed by /'s); our experiments 
show that for selective queries this index is faster than ours (by a 
factor of 10-20), while for non-selective queries our index is faster. 

Related Work. Compression by SLT grammars was used in |l l) 
for selectivity estimation of structural XPath. They study the space 
efficiency of binary encoded grammars with respect to other XML 
synopses, but do not study run times. It is also shows that up- 
dates can be handled incrementally with little space overhead; this 
is important also for our work, because we would like to support in- 
cremental updates in the future. The minimal DAGs used by Koch 
et al. can be seen as the first grammar-compressed approach to 
XML trees (a DAG naturally corresponds to a regular tree gram- 
mar). For usual XML document trees, minimal DAGs only exhibit 
10% of the original number of edges. More powerful grammar- 
compressors such as BPLEX 1 6 | further reduce this number to 5% 
and the recently introduced TreeRePair |21| to less than 3%. An 
SLT grammar generalizes DAGs from sharing of repeated subtrees 
to sharing of repeated tree patterns (connected subgraphs of the 
tree). They are equivalent to the sharing graphs used by Lamp- 
ing for optimal lambda calculus evaluation |19|. A self-index for 
grammar-compressed strings was presented in |7|. They show effi- 
cient support for extract and find. It can be shown, but goes beyond 
the scope of this paper, that the extract operation can be gener- 
alized from their string grammars to our SLT grammar, with the 
same time bounds as in their result. In 1 1 1 they use the succinct tree 
data structures of | 31| and add explicit copies for each label, us- 
ing compressed bit-arrays |26 |. This allows constant time access to 
taggedDesc and taggedFoll (using rank and select over bit-arrays), 
but becomes fairly memory heavy (for a 1I6M XMark document 
with 6 million nodes, they need 8MB for the tree, and additional 
18MB to support taggedDesc and taggedFoll in constant time). 

2. XML TREE COMPRESSION 

An XML document naturally corresponds to an unranked or- 
dered tree. For simplicity, we only focus on element nodes, at- 
tributes, and text values, and omit namespaces, processing instruc- 
tions, and comments. Our data model assumes that the attribute 
and text values are stored separately from the tree structure (in a 
"text collection"), and that they can be addressed by a function 
getText(n) that returns the n-th text or attribute value (in pre-order 
appearance). In our tree model, text nodes of the document are rep- 
resented by placeholder leaf nodes labeled by the special label _T. 
Similarly, attribute definitions are represented by "attribute place- 



holder nodes" labeled _A; such a node has children nodes which 
are labeled by the names of the attributes (prepended by the sym- 
bol "@") in their appearance order, which themselves have a single 
"attribute-text placeholder node" (labeled _AT). For instance the 
XML element <name id="9" r="4">Text</name> is represented, 
in term syntax, by this tree: name(_A(@id(_AT),@r(_AT)),_T). 
For a given XML document, such a tree is called its XML struc- 
ture tree. Obviously, these trees are larger than pure element- 
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Figure 2: Datasets used in experiments 
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Figure 3: Sizes of XML structure trees 



node trees, because of the additional placeholder nodes. The place- 
holder nodes help to get from a node in the structure tree to the 
corresponding value (by keeping track of how many placehold- 
ers have appeared so far). Moreover, they allow to answer cer- 
tain queries directly on the structure index, such as, e.g., the query 
//text(). To get a rough estimate of the different node counts for 
element only trees and their corresponding XML structure trees, 
see Figures [2] and |3] The "Non-Text" numbers refer to the sizes 
of XML files in which all text and attribute values where cut out 
(thus yielding non- valid XML). If those values are replaced by our 
placeholder nodes, then we obtain the XML structure trees whose 
sizes are shown in Figure [3] (their depth changes at most by two, 
due to attribute placeholders). The XMark files were generated 
with the XMark generator (http : / /www . xml-benchmark . 
org), Sprot437M is the protein databased used in |5||, and Tree- 
bank83M is a linguistic database obtained from |http : / /www . | 
[cs.w ashington/edu/ research/ xmldatasets' 

In our model, an XML structure tree is represented by a bi- 
nary tree which stores the first-child and next-sibling relationship 
of the XML document in its first and second child, respectively. 
The idea of grammar-based tree compression is to find a small tree 
grammar that represents the given tree. For instance, the minimal 
unique DAG of a tree can be obtained in amortized linear time (see, 
e.g., 1 5 1); it can be seen as a particular tree grammar (namely, a 
regular one). For instance, the minimal DAG for the binary tree 
t — f{f{a{b, c), a(c, c)), /(c, c)) can be written as this grammar: 

S -> fif{a(b,C),A),A) 
A ~> a{B,B) 
B ^ c 

The size of a grammar is the total number of edges of the trees in 
the right-hand sides of its productions. The grammar in the above 
example has size 8. In contrast, the original tree has size 10. In our 
grammars there is exactly one production for each nonterminal A. 
The right-hand side of A'& production is denoted by rhs(A). We fix 
a as the size of the alphabet of a grammar, consisting of terminal 
and nonterminal symbols. 



In an SLT grammar, sharing is not restricted to subtrees, but ar- 
bitrary tree patterns (connected subgraphs) can be shared. In the 
example tree t of above, the tree pattern consisting of an /-node 
and right subtree a(c, c) appears twice. As we can see, this tree 
pattern has one "dangling edge", namely, to the second-child of the 
/-node. In SLT grammar notation, a tree pattern is written as a 
tree in which special placeholders, called parameters, are inserted 
at dangling edge positions. The parameters are denoted i/i, j/2, • • • 
and are numbered in the order of appearance of dangling edges. An 
SLT grammar that represents t has these productions: 

S A{A{a{b,c))) 
A(,yi) f{yi,a{c,cj) 

The nonterminal A uses one parameter yi to represent the single 
dangling edge of the pattern mentioned above. The size of this 
grammar is still 8. The number of parameters of a nonterminal 
A is called its rank and is denoted rank(^). The maximal num- 
ber of parameters of the nonterminals of a grammar is called the 
rank of the grammar. Another important aspect of a grammar is its 
depth, which is the length of the longest sequence of nonterminals 
Ai, A2, . ■ . , Ad such that Ai+i appears in the right-hand side of 
Ai, for all 1 < j < d. Since all our grammars produce one tree 
only, the depth is bounded by the number of nonterminals. Given a 
nonterminal A (of rank fc), its pattern tree, denoted Ia, is the tree 
over terminal symbols and parameters yi, . . . ,yk, obtained from 
A{yi, . . . ,yk) by applying grammar productions (until no produc- 
tion can be applied anymore). 

While the minimal DAG of a tree is unique and can be found in 
linear time, the minimal SLT grammar is not unique, and finding 
one is NP-complete |6]. The BPLEX approximation algorithm |6) 
generates SLT grammars that are ca. half the size of the minimal 
DAG. The new TreeRePair algorithm |2I| improves this by another 
20%-30% (while improving run time by a factor of about 30). The 
above example grammar for t was produced by TreeRePair. Note 
that the rank of a grammar is important, because it influences the 
run-time of algorithms that directly execute on the grammar, such 
as executing tree automata or Core XPath |20|. Both BPLEX and 
TreeRePair take a user specified "maximal rank number m", and 
produce grammars of rank < m. 

3. STRUCTURAL SELF-INDEX 

We call our XML self-index "Tiny Tree" or simply "TinyT". The 
first layer of storage in TinyT consists of a small memory repre- 
sentation of the grammar. The second layer consists of additional 
mappings that support fast XPath evaluation. 

3.1 Base Index 

The base index consists of a small memory representation of the 
grammar. Start production right-hand sides are usually large trees 
(they represent the incompressible part of the XML structure tree) 
and are coded succinctly, using two alternative ways. All other pro- 
ductions are transformed into a normal form, so that each resulting 
production fits into a single 64-bit machine word. We experimented 
with two variants of representing the start rhs: 

(bp) the succinct trees of Sadakane and Navarro |3I| 

(ex) a naive custom representation. 

Both of these use s [log it] many bits to represent the tag sequence 
of the tree, where s is the number of nodes in the start rhs. The 
first one uses the "moderate size" trees of |3I|, requiring additional 
2s + O (s/ poly log (s)) bits of space. Our implementation of (bp) 
uses approximately 2.5 bits per node. The second one (ex) stores an 



309.6KB. For the tag sequence of the start rhs we need 88299 ■ 
[log 39631 + 891 = 172.5KB (there are 89 different labels for 
XMark). For (bp) our implementation uses 27KB, while (ex) uses 
183.2KB. Thus, the total sizes for (bp) and (ex) are, 509KB and 
665KB, respectively (the sum of the first three columns in Figure[5] 
up to rounding). 



explicit mapping called "find-close" which for every node records 
the number of nodes in its subtree. This is sufficient for our XPath 
evaluators because they only need pre-order access to the grammar, 
plus, the ability to "skip" a subtree. The find-close table allows to 
skip a subtree, by simply moving ahead in the tag-list by the num- 
ber of nodes specified in the table. It requires s [log s] bits. Clearly, 
this is rather wasteful compared to (bp), see the third column in Fig- 
ure|5]l, but can make a large speed difference: e.g., our XPath count 
evaluator for the query Q06 = /site/regions/*/item over XMark IG 
takes 3.5ms with (ex) and 4.7ms with (bp). Observe also the differ- 
ence in loading time of the two variants shown in Figure |6] 

We bring the remaining productions into binary Chomsky Nor- 
mal Form (bCNF). A production is in bCNF if it contains exactly 
two non-parameter nodes in its right-hand side. The bCNF can 
be obtained following exactly the same procedure as for ordinary 
CNF, see |22|. A grammar is in bCNF, if every production ex- 
cept the start production is in bCNF. For instance, in our example 
eft grammar of before, the ^-production is not in bCNF. We first 
change its right-hand side to f{yi, B) (which is in bCNF) and add 
the new production B — >■ a(c, c). The latter is not in bCNF and 
therefore is changed to _B — > C(c). The final grammar, called Qi, 
is: 

S -> A{A(a{h,c))) 

B ^ C{c) 
C{yi) a{yi,c) 

Note that the size of this grammar is 9, thus has grown by one. In 
general, the size of a grammar can grow by a factor r, where r is 
the rank of the original grammar. The rank of the grammar can 
grow by max(r, 1), and the number of nonterminals can become 
at most two times the size of the original grammar, as implied by 
Proposition 3 of |22|. If we transform the DAG grammar for t of 
before into bCNF, then a grammar is obtained of rank one and of 
size 9; consider t' = f{t, a{c, c)), then the minimal DAG in bCNF 
is of size 11 (because two edges are added in the start rhs), while 
our eft grammar has size 10 (we simply add another j4-node in the 
start production). In practice, we do not observe a large size in- 
crease; the largest was 79%, see the last column of Figure|4] Depth 
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Figure 4: Impact of bCNF 



increase can be large as shown in the figure. The rhs of each bCNF 
rule if of the form X{yi, . . .,y^-^,Y{yi, . . . , y^), y^+i, . . . ,yr) 
and thus is characterized by the triple {X, i,Y), where X and Y 
are nonterminals or terminals, and i is a number between 1 and the 
rank of X. We represent one bCNF rule by a single 64-bit ma- 
chine word, using 28 bits per nonterminal, 4 bits for the number i, 
and 4 bits for the rank of the nonterminal. Our experiments show 
that setting the maximal rank of BPLEX and TreeRePair to 8 and 
2, respectively, gave best results for our XPath evaluators over the 
corresponding indexes. Thus, limiting our memory representation 
to grammars of rank 15 (4 bits) is justified. We are now ready 
to calculate the space requirement of the grammar representation: 
#CNF-rules • 8Bytes + space (start-rhs). 

As an example, for XMark 1 16M we calculate, according to Fig- 
ure |4] 39631 productions in bCNF, multiplied by 8 bytes equals 
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Figure 5: Sizes of TinyT's components (in KB) 



3.2 Auxiliary Indexes 

There are two well-known principles of XPath optimization: (1) 
jumping and (2) skipping. Here, jump means to omit internal nodes 
of the document tree. In our setting, the "jumped" nodes will be 
those represented by a nonterminal of our grammar. We also say 
that the nonterminal is "jumped". Note that after a jump we still 
need to continue evaluating in the subtrees below the jumped pat- 
tern. Skipping means to omit a complete subtree. Thus, no evalu- 
ation is needed below skipped nodes. We now introduce the jump 
table which allows to jump nonterminals; this table suffices for our 
XPath count evaluator. To jump or skip during serialization and ma- 
terialization we need further tables (the "pre and text mappings"). 
Lastly, we mention the start-skip table which supports fast skipping 
of subtrees. 

Jump Table 

As mentioned in the Introduction, it was observed in [1, 24] that 
the two operations taggedDesc and taggedFoll allow drastic speed- 
ups for XPath evaluation. The SXSI system 1 24| keeps a large data 
structure (about 2.25 times larger than the rest of their tree store) 
in order to give constant time access to these operations. We try to 
add very little extra space to our (so far tiny) index, and still be able 
to profit from these functions. We build a "jump table" which keeps 
for every nonterminal X and every terminal symbol b a bit indicat- 
ing whether or not X generates a 6-labeled node. When executing 
a taggedDesc -call (with label 6), we try to derive the first descen- 
dant node with label 6; if a nonterminal during this derivation does 
not generate 6's (according to our jump table), then we do not ex- 
pand it, but "jump" it (by moving to its first parameter position). 
The taggedFoll function is realized similarly. For the sequential in- 
terface (Point 2 in the Introduction) plugged into SXSI 1 1 1 our ex- 
periments show that the speed-up through taggedDesc/taggedFoll 
is comparable to the speed-up obtained in SXSI. This is surprising, 
because the space overhead for our jump table is small: 65% of 
extra space, compared to the 225% in SXSI. 

The jump table is not only useful to realize taggedDesc and 
taggedFoll, but also allows speed-ups in all our XPath evaluators, 
see e.g. ql in Figure [TS] The size of the jump table (in bits) is 
the number of nonterminals multiplied by the number of differ- 
ent (terminal) labels. For instance, XMark uses 89 labels; thus, 
the jump table for XMarkl 16M is 39631 * 89 bits = 431KB. For 
Treebank83M which has 257 labels we obtain 37540 * 257 bits = 
1177. 7KB, see the fourth column in Figure [5] In fact, our XPath 
count evaluator only loads the base index plus the jump table, which 
implies the total index sizes as shown in Figure[T]as sum of the first 
four columns in Figure |5] (taking "ex"). 



Pre and Text Mappings 

In order to be able to materialize pre-order node numbers, or to 
access the text collection (needed for serialization), we need to 
calculate, during evaluation, the numbers of nodes/texts that have 
appeared until the current node following a dflr traversal. How- 
ever, if we "jump" a nonterminal using our jump table, then we 
do not see its terminals. Therefore we need another table which 
records for each nonterminal the number of element nodes that it 
generates, and similarly for the number of text nodes. In fact, the 
situation is more complicated: we actually need to store several 
numbers per nonterminal, as many as the rank of that nonterminal, 
plus one. With respect to evaluation in dflr order, jumping a non- 
terminal means to move to its first parameter position and continue 
evaluation there. Thus, we must know how many element symbols 
are on the path from tx 's root to the first parameter, where tx is 
the tree generated by X; note that tx contains exactly one occur- 
rence of each parameter of X. Similarly, once returned from X's 
first parameter position, we will want to jump to the second pa- 
rameter. We thus need to know the number of element nodes that 
are on the path between yi and 1/2 in the tree tx ■ The size of the 
corresponding table "prMap" is X^jf 6A;r(™"^(^) + 1) * [logfc], 
where k is the maximal number of element nodes on such paths, for 
any nonterminal. In fact, in our implementation we simply use a 4- 
Byte integer per value. For instance, our grammar for XMarkl 16M 
has 14057 nonterminals of rank zero, 14475 of rank one, 9311 of 
rank two, and 1786 of rank three. The size of the resulting prMap is 
(14057+14475*2+9311*3 + 1786*4 = 78084)*4B = 305KB; 
this explains the column "prMap" in Figure [5] The corresponding 
table with numbers of text nodes is called "textMap" table. 

Start-Skip Table 

If, during materializing or serializing we want to skip a subtree, 
then we still need to traverse that subtree of the grammar, in order 
to sum all numbers of element nodes/texts, respectively (using the 
pr and text mappings). To short-cut this calculation, the start-skip 
table is added. It stores for every node of the start rhs, the total 
number of element nodes/texts in its subtree. The size of this table 
is the number of nodes in the start rhs multiplied by [log n] , where 
n is the total number of element nodes/texts. In our implementation 
we simply use 4 bytes per such number. The corresponding table 
for the numbers of text nodes is called "textSSkip". 

3.3 Index Generation 

The generation of the base index consists of the following steps 

(1) generate XML structure tree (MakeSTree), 

(2) compress via TreeRePair, 

(3) transform into bCNF, and 

(4) build in-memory representation of TinyT components and 
save to file (BuildTinyT). 

Technically speaking. Steps (1) and (3) are not necessary but can 
be incorporated into the TreeRePair compressor. In Step (1) we 
merely replace all text and attribute values by placeholder nodes. 
This can be incorporated into the parsing process of TreeRePair. 
Similarly, TreeRePair can be changed so that it produces gram- 
mars that are already in bCNF. Since we also wanted to experiment 
with other compressors such as DAG and BPLEX, we implemented 
small programs for (1) and (3). Our program for (I) is a naive java 
implementation using SAX which is quite inefficient. Therefore the 
times for MakeSTree in Figure |6] should be ignored and the table 



should be read as: indexing time is dominated by grammar com- 
pression time. The times in Figure|6]for step (4) are for generating 
the base plus the jump index, i.e., the first four columns of Figure[5] 
The time for generating the two additional tables prMap and SSkip 
(columns 5 and 6 in Figurejsj is negligible, as it is is proportional to 
a "chunk-wise" traversal of the grammar (see Sections [4. 3 | and [6T| l. 
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Figure 6: Times (mui:sec) for index generation and loading 



4. THE THREE VIEWS OF A GRAMMAR 

An SLT grammar can be seen as a factorization of a tree into 
its (repeated) tree patterns. Each tree pattern is a connected sub- 
graph of the original tree and is represented by a nonterminal. In 
our algorithms we found a hierarchy of three different views of the 
grammar: 

(1) node- wise (the slowest), 

(2) rule-wise (the fastest), and 

(3) chunk-wise. 

The node-wise view is a proxy to the original tree and allows to 
execute arbitrary algorithms (using the first-child, next-sibling, and 
parent functions). This is the most "detailed" view, but causes a 
slow-down (comparable to the space improvement of the grammar, 
when compared to succinct trees). The rule-wise view is the most 
abstract and high-level view; it means to move through the gram- 
mar in one pass, rule by rule. Specialized algorithms such as exe- 
cuting finite-state automata can operate in this view. For strings 
this idea is well studied |30l. W e show in Section |4~2l that the 
"selecting tree automata" of |24| can be executed in the rule-wise 
view in order to count selected nodes. This is applied to XPath 
in Section [5] by compiling queries into selecting automata. The 
chunk-wise view is slightly more detailed than the rule-wise view. 
It means that the grammar is traversed (once) in a strict dflr order. 
This allows to keep track of pre-order numbers and text numbers, 
by keeping a global counts of element nodes/texts. Through the 
prMap and SSkip tables we can apply jumping in this view which 
allows to build fast XPath evaluators for serialization and material- 
ization. Processing in the chunk-wise view is slightly slower than 
rule-wise (proportional to the rank of the grammar), because the 
rule of a nonterminal of rank k is now processed fc + 1 times (in- 
stead of only once in rule-wise). 

4.1 Node-Wise View 

The node-wise interface allows to execute arbitrary algorithms 
over the original tree (see, e.g., |6|). Inside the interface, a node 
is represented by a sequence of pairs which shows the productions 
that were applied to obtain the node. The length of such sequences 
is at most the depth of the grammar, which be as large as 10000 (see 
Figure |4](. Thus, even if one pair fits into a single bit (which can 
be done) then this is large compared to the 32 or 64 bits for a pre- 
order node ID. We observe a slow-down of the original algorithm of 



approximately the same factor as the compression. Recursive tree 
algorithms need a lot of memory due to the size of these sequences. 
For iterative algorithms we obtain very good time/space trade-offs, 
see Figure |9] 

The node-wise interface provides the functions find-root, first- 
child, next-sibling, and parent (plus checking the current label of 
course). A node of the original tree is represented as a sequence 
of pairs rj = (Start, po)(^i, Pi) ■ ■ ■ i^j^Pj), where po is a node 
of the start rhs, Ai, . . . , Aj are nonterminals, and pi, . . . ,pj are 
nodes such that rhs(S) at node po is labeled Ai, and for every 
1 < * < i> the rhs for Ai at node p, is labeled Ai+i. More- 
over, it must hold that the rhs of Aj at node pj is labeled by a 
terminal symbol, say b. The node ID ri is labeled by b, denoted 
lab(r7) = b. The first child (fc), next sibling, and parent func- 
tions are realized as in Section 6.2 of j6). For instance, fc(r;) is 
the following node ID: we first move to the first child of pj in the 
rhs of Aj, if it exists. There are three possibilities: (1) fc(pj) is 
labeled by a terminal symbol. In this case we are finished and re- 
turn ri[{Aj,pj) {Aj ,pj .1)], i.e., rj with the last pair replaced 
by (Aj,pj.l). (2) fc(pj) is labeled by a nonterminal Aj+i. Let 
rj' = rj[{Aj,pj) (Aj , pj.l)]{Aj+i,e). If the rhs of Aj+i has a 
terminal at its root node, then the process is finished and return rj' . 
Otherwise, more nonterminals {Aj+i,e) . . . {Aj+k,£) are added 
until Aj+k has a terminal root node (and all Aj+i , . . . , Aj+k-i do 
not). (3) fc(pj) is labeled by a parameter yi. We remove the last 
pair from rj and consider the i-th child of the node Pj^i in the rhs 
of Aj-i. If it is a terminal, then we are finished. If it is a nonter- 
minal, then we expand as in Step 2. If it is again a parameter, then 
the pair is removed again, until a non-parameter last pair is found. 
This terminates with the desired node ID of the first-child node. 

As an example, the node ID rjo — (S, e){A, e) represents the /- 
labeled root-node of the tree represented by our example grammar 
Qi. To compute fc(77o) we move to the first child of / in A's rhs. 
This is the parameter yi. Thus, we pop rjo and move to the second 
A of the start rhs, (5*, 1). We expand the A in one step and obtain 
the result (S,l){A,e). 

4.2 Rule- Wise View 

The rule-wise view means that the grammar is traversed only 
once, rule by rule, and in each step only little computation takes 
place which is "compatible" with the grammar. A classical exam- 
ple of this kind of "computing over compressed structures" is the 
execution of a finite-state automaton over a grammar compressed 
string, i.e., over a straight-line context-free grammar (see, e.g.. The- 
orem 9 of 1 30]). The idea is to memoize the "state-behaviour" of 
each nonterminal. For tree automata over SLT grammars, the prob- 
lem was studied in |20| from a complexity theory point of view. 
We use selecting tree automata as in |24J and build a "count eval- 
uator" which executes in one pass over the grammar. It counts the 
number of result nodes of the given XPath query. 

The new aspect is to combine this evaluator with the jump ta- 
ble. Intuitively, if in a certain state only a given label b is relevant 
(meaning that only for that label the automaton changes state or 
selects the node), then we can jump over nonterminals that do not 
produce this label b (determined by the jump table). For instance, 
consider the query /////fe which selects all b-descendants of f-nodes. 
It should be intuitively clear that this query can be answered by 
considering only the / and b-nodes of the document (and their re- 
lationship). This means that during top-down evaluation we may 
jump nonterminals which do not produce / or 6 nodes. We now 
introduce, by example, selecting tree automata (ST automata), and 
discuss how they can be executed for result-counting over a gram- 
mar. We then show how jumping can be incorporated into this pro- 



cess. Here is an example of an ST automaton: 

<io,f qi,qo 

<lo,L-{f} qo,qo 

qi,b => qi,qi 

qi,L-{b} qi,qi 

The first rule says that if in state go the automaton encounter an 
/-labeled node, then it moves to state qi at the first child, and to 
state go at the second child. The second rules says that, in state 
go and for all labels (denoted by L) except /, it stays in state go 
at both children nodes. In state gi the current node is selected if 
it is labeled b (denoted by the double arrow in the third rule). 
The automaton realizes the XPath query over our binary tree 
representation of XML trees. We now want to execute this automa- 
ton over the grammar Qi of Section [JiT] in "counting mode", i.e., 
producing a count of the number of result nodes. It starts in state 
go processing the start rhs of the grammar. Its root node is labeled 
A, so the automaton moves to the A-production (still in state go). 
The first automaton rule applies at the /-labeled node, meaning to 
process the first child (yi) in state gi and the second child B in 
state go. The latter means to process C in state go which gives state 
go at yi. We are now finished with processing the nonterminal A 
in state go. In summary: no result node was encountered, and the 
state has moved from go to state gi at the first parameter yi . This 
"behaviour" of go on A is hashed as (0, gi). Of course, during this 
computation, the corresponding behaviors for C and B are hashed 
too, i.e., for go on C the value (0, go) and for go on B the value 
(0). The automaton continues in state gi at the second A-node of 
the start rule. Unfortunately, no hash for gi on A exists yet, so the 
automaton needs to be run. Again no result node is encountered 
and it stays in state gi at yi. Thus, (0, gi) is hashed for gi on A. 
Finally, it processes the a-node of the start production, in state gi . 
It gives gi at the b-node. This node is selected according to the 
third rule of the automaton and therefore our global result count is 
increased, to its final value of one. Observe that if there was a third 
A-node in the start rhs, such as for the slightly larger tree t' men- 
tioned before, then hashing is already useful because there will be a 
hash-hit for the third A. It should be clear that, in the same way, any 
ST automaton can be processed in one pass through the grammar 
(see also (11[|20| ). Note that we only evaluate ST automata that are 
deterministic; it means that for every state g and every label a there 
is at most one transition with left-hand side "g, a". 

Adding Jumping 

Consider the example automaton of before. It should be clear that 
in state go the automaton only cares about /-labeled nodes, i.e., 
it can omit all other-labeled nodes and safely proceed to the first 
/-labeled descendant node (if such a node exists). In the termi- 
nology of (24| , the omitable nodes are "not relevant". Here we 
say that a node is relevant if the automaton either selects the node, 
or changes state, i.e., applies a transition with rhs (g', g"), where 
g' 7^ g or g" 7^ g. Note that in 1 24 1 relevance is defined based on 
minimal automata; we have dropped this restriction and define it 
for arbitrary (but deterministic) ST automata. We further say that 
for state g, it is a relevant label if the automaton's transition for q 
and u is selecting, or changes state, i.e., has rhs (g', g") with q' ^ q 
or g" 7^ g. Obviously, during the run of the automaton, the relevant 
labels allow to determine the next relevant node. 

We use the jump table in order to omit ("jump") nonterminals 
which do not contain relevant nodes for the current state g: if the 
jump table indicates that a nonterminal does not produce nodes la- 
beled U = Ml, . . . , tifc, and the relevant labels of the current state 
are in U, then the nonterminal may be jumped. By our definition of 



relevance this implies that all parameters of of jumped nonterminal 
will all be processed in state q. Back to the example: Since / is a 
relevant label for go, we cannot jump the first j4-node of the start 
rhs. Hence, the automaton proceeds as before and eventually the 
entry (0, q\) is hashed for go and A. The automaton proceeds in 
state gi at the second A-noAe, of the start rhs. The only relevant 
label for gi is b. The jump table tells us that A does not gener- 
ate 6's. Thus, we jump this A and continue evaluating at its child 
node. This saves a lot of computation (roughly half of before). But, 
in which state is the automaton supposed to continue? It must be 
state gi because, by definition of relevance, the state never changes 
on all non-relevant nodes. Thus, parameter yi must be reached in 
state gi . We proceed, to the 6-node of the start rhs and compute the 
correct final count of 1. 

As another example, imagine the start rhs was A{A{b)) and we 
execute a query that selects all ^-children of the root node. In XPath 
lb (let us ignore that in XML the root node has only one child). An 
automaton for this query is: 

qo,b => gi,go 
qo,L-b gi,go 
gi,L gi,gi. 

In state go, all labels are relevant, because there is a state change 
in all transitions for go. We therefore process as before, eventually 
hash the entry (0, gi ) for go and A, and determine that yi of A need 
to be processed in state gi . For this state, no label is relevant. Thus, 
the second A may be jumped. We arrive at the b-node of the (new) 
start production of above, and terminate (with count zero). 

XPath Specific Finer Relevance 

For XPath, we found it beneficial to use a slightly finer definition of 
relevance. It allows to jump more nonterminals for automata that 
realize XPath. First, define that a state g is universal if it has the 
transition q,L ^ (g, g), i.e., never changes for any label. For all 
our automata there is at most one (fixed) universal state which is 
denoted by qu. For instance, in the above automaton for lb, qu ~ 
qi. We define: a node is not f-relevant if the automaton does not 
selects the node, and applies a transition with rhs (g, g), {qjj, g), 
or (qu, qu). For state g the label u is not f-relevant if the (g, u)- 
transition is not selecting, and its rhs is of the form (g, g), {qu, g), 
or {qu, qu)- Let us consider the last example of above again, the 
automaton for lb over our example grammar This time, only b is 
a relevant label for go, because qu = gi. The rule for jumping 
non-/-relevant nodes, in a given state g, is: the g-transitions for all 
non-/-relevant labels must all have the same rhs, which, itself is 
one of (g, g), {qu,q), or {qu, qu)- Thus, we may jump the first 
A-node of the start rhs. Now it is more difficult to determine in 
which state to proceed at yi of A: the root node of A's pattern tree, 
and its descendants of the form 2.2.2- ■ • .2 (in Dewey notation) are 
processed in state go, while all other nodes are processed in state 
gi. Since A's pattern tree is f{yi,a{c,c)), this means that gi is 
the correct state for t/i. However, if A's pattern tree was different, 
for instance f{a{c, c), yi), then we would need to assign the state 
go to yi- This shows that in order to correctly jump a nonterminal 
X which contains no /-relevant nodes, we need to statically know 

whether or not X's last parameter j/j occurs at a 2.2 2-node 

in X's pattern tree tx- This information is determined at indexing 
time and is stored with the grammar as part of our index. Since 
its size is negligible (one bit per nonterminal), we do not explicitly 
mention it in our size calculations. 



Adding Skipping 

When an automaton is in its universal state qu, we may skip the 
entire subtree because it contains no relevant nodes. For the count 
evaluator this is done by omitting all recursive calls to state qu- 
This holds for terminal nodes, as well as for the hashed behavior of 
nonterminal nodes. For the materialize and serialize evaluators, it 
is necessary to know the number of element nodes/text nodes of the 
skipped subtree to correctly continue evaluating. During recursion 
these numbers are determined by the prMap/textMap tables. If we 
are in the start rhs, then we use the SSkip/textSSkip tables. 

4.3 Chunk- Wise View 

We now wish to serialize XML result subtrees of the nodes se- 
lected by an automaton. Additional to the grammar, we need access 
to the text values of the XML document. We assume a function 
getText(i) which returns the i-th text or attribute value of the docu- 
ment (starting from zero). For instance, getText(6) returns the 7-th 
text value, i.e, the string serialization for this example doc- 
ument 

<g>This<f ><f ><a><b>is</b></a><c>a test</c></ 
f ><a><o>document</c><c>f or the purpose</c></ 
ax/fxaxoof explaining</ cxoserializatio 
n</c></a></g> 

A faithful grammar representation of the XML structure tree of 
this document is: 

S g{_T,A{A{a{b{_T),c{_T))))) 

A{yi) ^ .f{yi,B) 

B ^ C{c{T)) 

C{yi) ^ a(yi,c(_r)) 

For simplicity we do not transform this grammar into bCNF. We 
would like to serialize (using the jump table) the nodes selected by 
this automaton: 

qo,c => go, go 
qo,L-c go, go 

The automaton begins in state go at the root of the start rhs. The re- 
cursive algorithm over grammar rules is shown in Figure|7] exactly 
the same algorithm is used over the start rhs (but iteratively, using 
stacks). During a dflr traversal the global counter num_T stores the 
number of _T nodes seen so far Thus, at g's first child num_T is set 
to 1. We now process, still in state go, A's first chunk (that is: the 
sequence of tags from Ia's root node to its first parameter node). 
This is done by first calling the rule- wise evaluator of Section [42] 
in order to compute and hash the parameter states for A and the 
information whether a parameter is inside a result subtree (see the 
Ui's in the algorithm of Figurep] This will add the hashes (g, go) 
for go on C, and (2) for B, and (2, go) for A- The first chunk only 
contains <f> and therefore the empty tag sequence (0, 0) is hashed 
for A's first chunk in go, i.e., for the triple (go,^, !)■ Further, 
(go,j4,2) is pushed onto our "pending computation" (PC) stack. 
The next step in dflr is the first chunk of the second A-node. Both 
rule-wise and chunk-wise behaviors are hashed already, so nothing 
needs to be computed and again (go, A, 2) is pushed onto the PC 
stack. The dflr traversal continues at the subtree a{b{_T), c{_T)). 
The a and 6 nodes do not cause state changes or node selection. At 
the first _T-node, num_T is set to 2. At the c-node a selecting tran- 
sition fires. Thus, we now start appending tags to the "intermediate 
result tag" (IRT) sequence, first the tag <c>. We also append the 
pair of start position and current num_T value to the "final result 
list". Moreover, </c> is pushed onto the PC stack. Evaluation 



continues at the _T-child (thus num_T is increased to 3). We ap- 
pend <_T/> to the IRT sequence. Since _T is a leaf, the PC stack 
is popped and therefore add </c> to the IRT sequence. This fin- 
ishes the result subtree. At the next step we return to the a-node 
of the start rhs. We return to the second ^-node and pop the PC 
stack. This gives {qo,A,2). No (begin, end)-pair is hashed for 
this triple, so A's second chunk is processed in state go- Recur- 
sion continues to B and C and finally move to the parameter tree 
c{_T) of C in B's rhs. This causes to append <c><_T></c> 
to the IRT sequence, to append (4, 2) to the final result list, and 
to increase num_T (to 4). We proceed at the second chunk of C, 
ignore </a>, and append <c><_T></c> and (7,3) to the IRT 
sequence and final result list, respectively, and increment num_T 
(to 4). The pair (7, 9) is now hashed for the triple {qo, C, 2). The 
grammar recursion continues at B and A, so (4, 9) is hashed for 
{qo, B, 1) and (4, 9) for {qo, A, 2). The dflr run continues at the 
first A of the start rhs and pop the PC stack. This gives {qo,A, 2). 
We now have our first hash-hit and happily retrieve the (begin,end)- 
pair (4, 9). This is interpreted as a "copy instruction": append to 
the IRT sequence (currently at position 10) its own content from po- 
sition 4 to 9. During this copying we observe that 4 and 7 are final 
result begin-positions, and that their corresponding num_T-values 
are 5 and 6, respectively (by incrementing num_T during copying). 
The content of the final result hst is (1, 2) (4, 3) (7, 4) (10, 5) (13, 6) . 
The IRT sequence contains five copies of <c><_T></c>. In a fi- 
nal step we print correct XML document fragments for each result. 
This is done by copying from the IRT sequence while inserting for 
each <_T/> the correct text value. 



To see how jumping works, consider the query //b over this gram- 
mar. Now A's first chunk can be jumped. During the rule-wise 
traversal, jumping takes place as discussed in Section [4~2] Next, 
the first chunk of the second yl-nods in the start rhs is jumped. The 
first hit for //b is obtained at the &-node of the start rhs. The dflr 
traversal jumps the second chunks of both A-nodes of the start rhs, 
and is finished. The final result list is (1,1) and the IRT sequence 
is <b><_T></b>. Thus, we print <b>is</b>. 

Comments to Figure |7] In Line 2 we calculate rule-wise the pa- 
rameter states si, . . . ,s„ and the Booleans ui, . . . ,Un which de- 



termine if a parameter is inside of a result subtree. Line 3: if p = 
then Hp refers to the root node and if p = rank(A'^) then yp+i also 
refers to the root node. The Xi and pi are determined by the shape 
ofrhs(iV). 

5. XPATH EVALUATION 

We built rudimentary XPath evaluators that compile a given XPath 
query into an ST automaton. Our current evaluator only works for 
the /, //, and following-sibling axes and does not support filters. The 
count evaluator is based on the rule-wise view of Section l4^ while 
the materialize and serialize evaluators are based on the chunk- wise 
view of Section [43| For the small XPath fragment considered here, 
the translation into ST automata is straightforward and similar to 
the one of p]|24] (essentially, the automaton is isomorphic to the 
query). First, an automaton is built which uses nondeterminism for 
the //-axis. For instance, we first obtain an automaton similar to 
the one shown in the beginning of Section [4~2| but with L — {/} 
replaced by L, and L — {b} replaced by L. Different from (T||24) 
which work on-the-fly, we fully determinize the automaton before 
evaluation. For the example, this gives precisely the automaton 
as shown. We can prove that determinization of our ST automata 
does not cause an exponential blow up. This is due to the sim- 
ple form of our queries. Moreover, the determinization procedure 
always produces minimal automata. In fact, it can be shown that 
for a given XPath query with m-axes (over /, //, and following- 
sibling) the resulting deterministic ST automaton has at most 2m 
states. Note that in terms of the transitions along a first-child path, 
our deterministic ST automata behave exactly in the same way as 
"KMP-automata" (see, e.g.. Chapter 32 of |8J), i.e., matching along 
a path works very much in the same way as the well-known KMP- 
algorithm. What is the time complexity for counting, i.e., of ex- 
ecuting a deterministic ST automaton over an SLT grammar? As 
mentioned already in |20|, even for general context-free tree gram- 
mars, a deterministic top-down tree automaton can be executed in 
polynomial time. We make this more precise: for each nonterminal 
(of rank k) of the grammar and state of the automaton, we need to 
compute only at most one fc-tuple of parameter states. Hence, the 
following holds. 

Lemma 5.1. Let G be an SLT grammar in which every pro- 
duction is in bCNF and let M be an ST automaton. Let n be the 
number of nonterminals of G, k the rank of G, and m be the num- 
ber of states of M. The automaton M can be executed rule-wise 
(e.g., for counting) over the grammar G in time 0{mnk). 

Note that an alternative way is to first reduce the number of pa- 
rameters of the grammar to one, using the result of f221. For a 
binary ranked alphabet (as we are using here for XML), the size of 
the resulting grammar is 0(2|G|), where \ G\ denotes the size of G. 
We then apply the above theorem in time 0{mn'), where n' is the 
number of nonterminals of the new grammar. It remains to be seen 
in practice which of the two approaches give better running times. 

6. EXPERIMENTS 

All experiments are done on a machine featuring an Intel Core2 
Xeon processor at 3.6GHz, 3.8GB of RAM, and an S-ATA hard 
drive. The OS is a 64-bit version of Ubuntu Linux. The kernel 
version is 2.6.32 and the file system is ext3 with default settings. 
All tests are run with only the essential services of the OS running. 
The standard compiler and libraries available on this distribution 
are used (namely g-l~l- 4.4.1 and libxml2 2.7.5 for document pars- 
ing). Each query is run three times and of the three running times 
select the fastest one. We only count query execution time, i.e.. 



function recPrint{nt N, state s, chunkNo p, boot u) { 

let S = {Xl, Sl, pi, Ul)(X2, S2,P2, U2) .. . iX„, Sn, Pn, "n) 

be the of T/NT-chunks in rhs(Af) between "j/p and j/p_|_i"; 
int currLength = IRTJength; 
for j = 1 to n do 

if (Xi = nonterminal) then 

if (hash(Xi, Si,pi, Ui) = (zi, Z2)) then 
for j = 1 to 22 do 

append(IRT, IRT[zi + j]); 

if (IRT[2i + j] is a result) then 

append(FRL, (IRTJength, num_T)); 
else recPrint(Xi, Si,pi,Ui); 
else 

if (Pi = 0) then 

if ((si, tag(Xi)) is selecting or = 1) then 

append(IRT, "<tag(Xi)>"); 
if ((Si, tag(Jfi)) is selecting) then 

append(FRL, (IRTJength, num_T)); 
if (tag(Xi) = _T) then num_T++; 
if (Pi = 1 and ((si, tag(Xi)) is selecting or Ui = 1) then 
append(IRT "</fag(Xi)>"); 
hash(A'^, s, p, u) = (currLength, IRTJength - currLength); } 

Figure 7: Grammar- recursive case of the print function 



do not take into account query translation times etc. For experi- 
ments that involve serialization the programs are directed to write 
to /dev/nuU. 

MonetDB:. We use Server version 4.38.5, release Jun2010-SP2. 
This contains the MonetDB/XQuery module vO.38.5. We compare 
pure query execution time, so report the "Query" time reported. 

Qizx. Version 4.0 (June 10th, 2010) of the free engine is used. 
The "-V -r 2"-switches are used. For count queries we use the "eval- 
uation time:"-number reported by Qizx. For serialization the sum 
of the "evaluation time:" and "display time:"-numbers are used. 
For a few count queries Qizx executed faster over the XML struc- 
ture tree than over the original XML document. This is indicated 
by a footnote in Figure reffig:xmarkrun. 

SXSI. The version used for 1 1 1 was supplied to us by the authors. 

6.1 Traversal Access 

To investigate the speed of our node-wise view, we consider fixed 
traversals: depth-first left-to-right (dflr) and dfrl traversals, both re- 
cursively and iteratively. Dflr traversals are common access pattern 
for XPath evaluation. 

The speed of our interface is lower-bounded by the speed of the 
start rhs representation. Since it takes time to get a single pair out of 
our node ID sequence data structure, a plain traversal through the 
(bp)-start rhs is slower than a traversal through "Succinct" (=the 
whole XML structure tree represented in (bp)). To see this, we 
built grammars that have no nonterminals (except Start) but store 
the complete tree in their start rhs. The full traversal speed of these 
grammars is shown as OneRule in Figures [8]and|9] It is also pos- 
sible to transform the Start rhs into bCNF. Intuitively, this will in- 
troduce as many new nonterminals as there are nodes in the Start 
rhs. If we apply this to the OneRule grammars of before, then we 
obtain NoStartRule grammars in which each node is explicitly rep- 
resented by a nonterminal. The traversal speed over such grammars 
should be comparable to that of pointers, because this is similar to 
a pointer-based representation. Again, this is not exactly the case, 
because of the additional overhead implied by our node ID data 
structure. Finally, compressed grammars: we test (binary tree) 
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Figure 8: Recursive tree traversals over XMark 

DAGs, BPLEX, and TreeRePair grammars. The resulting traversal 
speeds for iterative full traversals are shown in Figure|9] For recur- 
sive traversals the graph looks similar: all run times are about twice 
as fast as in the iterative graph, except for "Pointer" which stays the 
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Figure 9: Iterative tree traversals over XMark 



same. The only big difference is that for recursive, the DAG line is 
in between the TreeRePair and the NoStartRule lines. Note that for 
recursive traversals we added a data structure called "node pool" 
which realizes prefix-sharing of node IDs. Without such a data 
structure, recursive traversals are roughly ten times slower (due to 
dynamic allocation of node IDs). Through profiling we found the 
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Figure 10: Space requirement for iterative traversals 

reason why DAG traversals are much slower in the iterative case: 
DAG grammars we have about eight times more calls to the parent 
function (in the start rhs). The number of these calls is approxi- 
mately equal to the number of nodes in the start rhs. As shown 
in Figure [T7] the size of the start rhs is about eight times more 
than those of BPLEX and RePair. Note that for NoStartRule gram- 
mars the number of nonterminals equals two times the number of 
non-_T-nodes of the document, plus the number of _T-nodes of the 
document. This is because we use one fixed nonterminal to rep- 
resent _T-nodes, i.e., we hash-cons all _T-subtrees. The OneRule 
grammars have (2n — l)-many nodes in the start rhs, because for 
every binary node there is an additional null-tree. 

To summarize the time/space trade-off: for recursive traversals, 
compared to succinct trees our interface (using TreeRePair gram- 
mars) is 5-6 times slower and uses 3 times less space, while it is 12 
times slower and uses 18 times less space when compared to point- 
ers. For iterative traversals we are 7 times and 15-16 times slower 
compared to succinct and pointers, respectively, and use 9-15 and 
167-309 times less space, respectively. 

6.2 Counting 

Figure[T2]shows timings for XPath counting over 1 16MB, 1GB, 
and UGB XMark files. The queries Q01-Q08 and QI3-Q16 are 
shown in Figure |1 1| while queries X1-X3 (taken from (Tj) ) are 
shown in Figure |16| For TinyT we load our base index plus the 
jump table. For SXSI and MonetDB it was beneficial to load the 
entire original document: this gave faster counting times than pro- 
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Figure 11: Benchmark queries over XMark 



cessing over an XML document representing the XML structure 
tree. For Qizx the same holds, but, for queries Q05 and XI the 
XML structure tree gave faster times, as indicated by the footnote 
in Figure [12] The figure shows counting times for our benchmark 
queries of Figure[TT] over our three different XMark files. We were 
not able to load the IIG XMark document into SXSI or Qizx. We 
did succeed to load it in MonetDB, but times get rather slow from 
Q03 onwards, due to disk access. As can be seen, TinyT is faster 
than all other systems. Moreover, count times for TinyT scale with 
respect to the query: for XMarkll6M, all queries run in <7ms. 
Similarly for the other documents. This is in stark contrast to all 
other systems. 

The run-time memory of our count evaluator essentially consists 
of the index, plus the hash table for parameter states, plus num- 
ber counters for each nonterminal. This adds about 12 Bytes per 
(state,NT)-pair. Typically, for an index of 1MB, we have an addi- 
tional 2-3MB of run-time memory. 

Label Queries 

A label query is of the form lllab and counts the number of lab- 
labeled nodes in the document. Several specialized indexes can 
be used for fast label-query execution. Obviously, such queries 
are not very interesting (and could be solved through a small extra 
table). But, for some systems, such as SXSI, those queries can 
easily be restricted to a subtree range. This gives more flexibility; 
for instance, count queries such as llallb could be optimized by 
moving through the top-most a-nodes, and summing the subtree 
counts of lib for each such node. 

When we write "SXSI" in Figures [T3] and [14] we mean the tree 
structure index of SXSI. The latter uses several copies (one per 
label) of the balanced parenthesis structure |31|, and compresses 
those using sarrays 1 26 1 (uncompressed copies would be even much 
larger: 1.8M for one copy of the XMarkl 16M document, times 88 
labels gives 158MB, while with saiTays SXSI only needs 25M). In- 
tuitively, the private copy of the parenthesis structure for a given la- 
bel lab indicates only the Zai-labeled nodes of the document. Thus, 
to execute the query //category, SXSI accesses the category-copy 
of the parenthesis structure and asks for the number of ones in this 
structure (realized by the "rank" operation which is efficiently im- 
plemented for sarrays). The sizes of the different indexes are shown 
in Figure[T4| 

As the timings in Figure [Ts] show, SXSI is the fastest for such 
queries, and delivers constant time. In the figure "Fer+" refers to 
an implementation of j9] [TO) which was kindly supplied to us by 
Francisco Claude (see the next section for more details). In this 



implementation the speed depends on the selectivity of the query. 
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Figure 13: Label queries, counting (in ms) 
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Figure 14: Index sizes (in MB) 



Simple paths 

An XPath query of the form / /a\la2l- ■ ■ Ian, where ai, . . . , a„ 
are label names, is called simple path. Note that each a; must be 
an element name, i.e., the wildcard-star (*) is not allowed. Such 
queries can be handled by the specialized index of Ferragina et 
al. (9I |I0|. In fact, that index can even materialize result nodes, 
but not by pre-order numbers. We therefore did not include it in 
Section |6.4| We use our own implementation of |9j ^lOJ , called 
"Fer-l-". It is optimized for speed, not for size. Their own Java 
implementation (http : / /www .di.unipi.it/~ferragin /"I 
|Libraries /xbwt - demo . z ip I produces much smaller indexes, 
see column "Fer-j" in Figure [14] but also performs much slower. 
For instance, it uses 476ms for query ql and takes 106ms for the 
query q2. 



Our "Fer-l-" implementation is fast for queries with low selectiv- 
ity and slow for those with large selectivity. This can be seen in 
Figure[T5] for ql which has the lowest selectivity, Fer-l- is 15-times 
faster than TinyT, while for q4 TinyT is slightly faster than Fer-l-. 
For larger XMark sizes the relative performance of TinyT is better, 
due to compression: for XMarklG, TinyT is already faster for q3, 
and is faster by a factor of > 4.5 for query q4. 

6.3 Serialization 

Figure [T2] shows timings for serialization over 116MB and 1GB 
XMark files. TinyT gave the fastest times for all our serialization 
experiments. For printing a single subtree (e.g., QOl and Q02) 
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Figure 15: Simple path queries, counting (in ms) 
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t Time is over the stripped XML structure tree document of Figure|3](faster than over original document) 


















Figure 12: Times (in ms) for XMark benchmark queries 









TinyT is about 1.5-times faster than the next fastest system (SXSI). 
For larger result sets the time difference is bigger. The largest time 
difference is for Q07 (over XMarklG): TinyT serializes 5.8-times 
faster than SXSI. For the same query over XMarkl 16M the differ- 
ence is only 2.1. This suggests that the speed-up is related to the 
compression in our index. This is interesting, because one would 
expect that the pure XML serialization time will dominate query 
evaluation and book keeping times. 

For TinyT, we load all indexes shown in Figure [5] together with 
a "text collection". The latter gives access to getText(i), the i-th 
value (text or attribute) of the document. Our text collection stores 
all text (consecutively) in a huge memory buffer. This takes space 
(size of the file minus "Non-Text" value in Figure |2] e.g., 82MB 
for XMarkl 16M. We use a simple data structure to map from text 
numbers to begin positions in the buffer. During serialization we 
opted for speed, not space. Recall from Section [43] our serializa- 
tion process: we first build tag sequences of all document subtrees 
to be output, together with copy instructions that point into those 
sequences. These tag sequences still contain _T tags. After eval- 
uation, XML serialization starts by (1) writing full XML subtrees 
by correctly replacing _T nodes by their text values and also (2) 
replacing copy instructions by their correct serialization. 

6.4 Materialization 

Initially TinyT was built for fast evaluation of XPath count queries. 
Later we realized its usefulness for fast serialization; the key idea 
was to avoid materialization of result nodes and to print directly in 
parallel with query evaluation. The running times for both counting 
and printing are highly competitive, as can be seen in Figure[T2] We 
also wanted to compare to specialized systems such as implemen- 
tations of twig queries. Twig queries have been studied extensively 



both from a theoretical and an implementational view point. They 
belong to the most highly optimized XPath queries (see, e.g., |I4[ 
15 ]) and the references those articles) Twig query implementations 
materialize several context nodes per query result. This is differ- 
ent from XPath semantics in which one node only is selected at a 
time. Clearly it would not be fair to compare our count evaluator 
with a twig implementation that materializes (even multiple nodes 
per result). We decided to build a materializer for TinyT which pro- 
duces pre-order numbers of the result nodes. This was done in short 
time, by essentially reusing the code of the serializer, and indeed, 
doing a fair amount of serialization in memory during materializa- 
tion. Certainly, this implementation is far from optimal; it would be 
much more efficient to work over node offset numbers, rather than 
serialized XML tag sequences. As the experiments in Figure [76] 
show, TinyT is the fastest only for query X3, while for XI and X2 
XLeaf and TJStrictPre are the fastest, respectively. We believe that 
a more efficient implementation of materializing over TinyT can be 
considerable faster, ca. 2-3 times slower than counting. 

6.5 Compression Behavior 

Our algorithms that execute without decompression directly on 
the grammar such as rule- wise XPath counting or chunk-wise XPath 
serialization, both do one pass through the grammar. Thus, the run- 
ning time of these algorithms is strongly influenced by the size of 
the grammar. TreeRePair generates smaller grammars than BPLEX 
(about half the size, in terms of numbers of edges) | ,21J , which it- 
self makes smaller grammars than DAGs |6|. Therefore, our count 
and serialize XPath evaluators run fastest over grammars produced 
by TreeRePair. The size of the start rhs is important too, because 
access is slower and more complicated than over the recursive rules 
(compare run times of OneRule with NoStartRule in Figures [8] 
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Figure 16: Twig queries, materialization (in ms) 
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and[9](. BPLEX grammars have relatively small start rhs's, but, 
the problem with those grammars is the high rank of nonterminals: 
often 10 or more parameters are needed in order to get small gram- 
mars. Figure [it] shows information about the grammars used in 
Section[6jT]for the recursive and iterative traversals. The underlined 
number is the size of the start rhs, and the numbers below are: num- 
ber of nonterminals of rank 0, number nonterminals of rank 1, etc. 
After transformation to bCNF, DAGs have one parameter. TreeRe- 
Pair was instructed to produce grammars of rank < 1 (which gives 
grammars of rank 2 in bCNF). This gave the best performance 
for our traversal experiments. Note that for the XPath evaluators, 
TreeRePair grammars of rank 2 are optimal (which have rank 3 in 
bCNF). For BPLEX we generated grammars that have 12 parame- 
ters in their final bCNF form (only the first 5 numbers of nontermi- 
nals are shown in the figure). To see the impact of the size of the 
grammar, consider the query //listitem/Zkeyword (which is adequate 
because no skipping takes place in the start rhs) over XMarkl 16M: 
our count evaluator takes 5ms over a TreeRePair (rank 2) grammar. 
In contrast, evaluating over a DAG grammar takes 28ms (the time 
for BPLEX grammars is in the middle: 17ms). This is due to the 
large number of parameters of BPLEX grammars: If the rank of 
a grammar is high, then hashing and handling of parameter states 
of a nonterminal becomes more expensive in the automaton evalu- 
ation functions of our XPath evaluators. This can be seen best on 
two grammars produced by TreeRePair for XMarkl IG. Both gram- 
mars are of similar size, but one has rank 8 while the other has rank 
3. Our count evaluator takes 38ms for the first grammar and only 
28ms for the second. Thus, the number of parameters has a large 
impact. In summary, setting the maximal rank to 2 in TreeRePair 
(and thus obtaining grammars of rank 3 in bCNF form), gives the 
best trade-off for our evaluators across all tested documents. 

7. DISCUSSION 

We presented a new structural index for XML and evaluated its 
performance for XPath evaluation. The index is based on a gram- 
mar compressed representation of the XML structure tree. For 



common XML documents the corresponding indexes are minus- 
cule. When executing arbitrary tree algorithms over the index, a 
good time-space trade-off is obtained. For certain simple XPath 
tasks such as result node counting, impressive speed-ups can be 
achieved. Our rudimentary XPath implementation over this index 
outperforms the fastest known systems (MonetDB and Qizx), both 
for counting and for serialization. We built and experimental ma- 
terializer which is competitive with the state-of-the art twig query 
implementations. We believe that our system is useful for other 
XPath evaluators and XML databases. It can be used for selectiv- 
ity computation of structural queries, and for fast serialization. It 
will be interesting to extend our current XPath evaluators to handle 
filters, and also to handle data value comparisons. For the latter 
bottom-up evaluator as in |24| could be built, which first searches 
over the text value store, and then verifies paths in the tree, in a 
bottom-up way. For such queries the SXSI system 1 1 1 is highly ef- 
ficient. We do not expect to achieve faster run times with our index, 
but think that run times similar to those of SXSI can be achieved. 
This is a large improvement, because the space requirement of our 
index is much smaller than that of SXSI. 

It would be interesting to add specialized indexes which allow 
more efficient running times for simple queries, such as simple path 
queries of the form j jaxja'^l . . . /a^. Over strings, the self index 
of |7| allows to find occurrences of such queries in time logarithmic 
in the number of rules of the grammar. Can their result be gener- 
alize to the tree case? In terms of extraction (decompression) there 
are new results for DAGs |3| that run in time logarithmic in the 
number of edges of the DAG. Can this result be generalized from 
DAGs to our SLT grammar? 
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